Microsoft Foundry Blog

10 MIN READ

Meta’s next generation model, Llama 3.1 405B is now available on Azure AI

Microsoft

Jul 23, 2024

In collaboration with Meta, Microsoft is announcing Llama 3.1 405B available today through Azure AI’s Models-as-a-Service as a serverless API endpoint. The latest fine-tuned versions of Llama 3.1 8B and Llama 3.1 70B are also now available on Azure AI Foundry model catalog. Developers can rapidly try, evaluate and provision these models in Azure AI Foundry using popular LLM developer tools like Azure AI prompt flow, OpenAI, LangChain, LiteLLM, CLI with curl and Python web requests.

We are also announcing Llama 3.1 8B, Llama 3.1 70B, Llama 3.1 8B Instruct, Llama 3.1 70B Instruct, Llama Guard 3 8B and Prompt Guard now available in Azure AI through managed compute deployments.

We are thrilled to be one of Meta’s launch partners in this innovative release for advanced synthetic data generation and distillation where 405B-Instruct is used as a teacher model and 8B-Instruct/70B-Instruct models serving as student models. Enterprises and developers can now streamline the development process while maintaining performance and cost efficiency, leveraging AI to build complex applications for a variety industry task-specific use case.

The Growing Need for Specialized AI Models

Large Language Models (LLMs) are known for their impressive few-shot learning and reasoning abilities. However, for applications that need tailored responses, the comprehensive capabilities of larger models can be excessive. This over-qualification leads to high computational demands and increased latency, making them less suitable for specific-use scenarios.

As such, customers can leverage powerful large models as a teacher model to train small student through distillation, resulting in tailored models ready for use in domain-specific use cases:

Customer Support: Automated systems need to provide accurate, relevant responses to diverse customer queries.

Healthcare: AI-driven diagnostics and patient interaction require precise, context-sensitive information.

Legal Services: Document drafting, and legal advice must be tailored to specific legal scenarios and client needs.

Education: Personalized tutoring systems that cater to individual learning paces and styles.

Finance: Tailored financial advice and portfolio management based on individual client profiles and market conditions.

Introducing Llama 3.1 405B on Azure AI

According to Meta, Llama 3.1 405B is expected to be the largest and most powerful open-source model available, built with delivering specific capabilities to developers:

Synthetic data generation and distillation

A significant hurdle in customizing smaller models is the substantial computational effort required to annotate vast datasets. Here, the Llama 3.1 405B Instruct synthetic data generation capability through distillation becomes invaluable.

Distillation involves using a teacher model – like Llama 3.1 405B - to generate synthetic data for fine-tuning student models – like Llama 3.1 8B and Llama 3.1 70B. The generated data allows developers to build task-specific fine-tuned models when solving domain-specific industry use cases, as mentioned above. By enabling data generation and distillation, Llama 3.1 405B can streamline this process, reducing the time and resources spent on data annotation while maintaining high performance. For example, a smaller Llama 3.1 405B model or other efficient models like the Phi-3 series can be fine-tuned with this synthetic data, ensuring that the end application is trained on relevant data, thereby being robust and responsive without the computational and organizational overhead of a full-scale LLM. We also provide distillation based on Chain Of Thought reasoning and have seen significant improvement in accuracy on the Meta-Llama-3.1-8B-instruct model for NLI task.

Direct model usage

Using a combination of quantization, speculative decoding or other optimization techniques, Llama 3.1 405B will be a highly advanced model for both batch and online inference.

Domain specific model

Llama 3.1 405B can serve as a base model for specialized continual pre-training or fine-tuning in a specific industry domain.

Open models Accessibility and Enterprise-Grade Reliability

The potential of the Llama 3.1 405B is magnified by its availability under an open license, allowing unrestricted access for commercial and research purposes. This openness encourages widespread adoption and innovation, offering developers the freedom to experiment and tailor solutions to their specific needs without the overhead of licensing restrictions.

Exploring the Llama-3.1 models and benefits on Azure AI

The Llama 3.1 collection of LLM includes pretrained and instruction-tuned generative models in 8B, 70B, and 405B sizes, supporting long context lengths (128k) and optimized for inference with grouped query attention (GQA). These models are designed for multilingual (English, German, French, Italian, Portuguese, Hindi, Spanish, Thai) dialogue use cases. According to Meta, these models outperform many open-source chat models on industry benchmarks.

The Llama 3.1 collection of models employs an optimized transformer architecture and uses supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) for alignment with human preferences. The instruction-tuned text-only models are particularly effective for tool use, supporting zero-shot tool use and specific capabilities like search, image generation, code execution, and mathematical reasoning. For more information, see the Meta Blog post.

Why Azure AI for Meta Llama 3.1?

Developers using Llama- 3.1 models can work seamlessly with tools in Azure AI Foundry, such as Azure AI Content Safety, Azure AI Search, and prompt flow to enhance ethical and effective AI practices.

From today, customers can access the following models through serverless APIs:

Meta- Llama-3.1-405B-Instruct

Meta-Llama-3.1-70B-Instruct

Meta-Llama-3.1-8B-Instruct

Through managed compute deployment, customers can provision the following models using their available quota:

Meta-Llama-3.1-70B-Instruct

Meta-Llama-3.1-70B

Meta-Llama-3.1-8B-Instruct

Meta-Llama-3.1-8B

Llama-Guard-3-8B

Prompt-Guard-86M

With Azure AI, customers receive the benefit of:

Enhanced Security and Compliance: Azure places a strong emphasis on data privacy and security, adopting Microsoft's comprehensive security protocols to protect customer data. With Llama 3.1 405B on Azure AI Foundry, enterprises can operate confidently, knowing their data remains within the secure bounds of the Azure cloud, thereby enhancing privacy and operational efficiency.

Content Safety Integration: Customers can integrate Llama 3.1 405B models with content safety features available through Azure AI Content Safety, enabling additional responsible AI practices. This integration facilitates the development of safer AI applications, ensuring content generated or processed is monitored for adherence to regulatory requirements and guidelines, compliance and ethical standards.

Simplified Assessment of LLM flows: Azure AI's prompt flow allows evaluation flows, which help developers to measure how well the outputs of LLMs match the given standards and goals by computing metrics. This feature is useful for workflows created with Llama 3.1 405B; it enables a comprehensive assessment using metrics such as groundedness, which gauges the pertinence and accuracy of the model's responses based on the input sources when using a retrieval augmented generation (RAG) pattern.

Client integration: You can use the API and key with various clients. Use the provided API in Large Language Model (LLM) tools such as prompt flow, OpenAI, LangChain, LiteLLM, CLI with curl and Python web requests. Deeper integrations and further capabilities coming soon.

Simplified Deployment and Inference: By deploying Meta models through MaaS with pay-as-you-go inference APIs, developers can take advantage of the power of Llama 3.1 405Bwithout managing underlying infrastructure in their Azure environment. You can view the pricing on Azure Marketplace for  Llama 3.1 405B, Llama 3.1 70B, Llama 3.1 8B and fine-tuned models for Llama3.1 8B and 70B based on input and output token consumption.

These features demonstrate Azure's commitment to offering an environment where organizations can harness the full potential of AI technologies like Llama 3.1 efficiently and responsibly, driving innovation while maintaining high standards of security and compliance. 

Getting Started with-Llama-3.1 on Azure AI

To get started and deploy your first model, follow these clear steps: 

Familiarize Yourself: If you're new to Azure AI Foundry, start by reviewing this documentation to understand the basics and set up your first project.

Access the Model Catalog: Open the model catalog in AI Foundry.

Find the Model: Use the filter to select the Meta collection or click the “View models” button on the MaaS announcement card.

Select the Model: Open the Meta-Llama-3.1-405B-Instruct text model from the list.

Deploy the Model: Click on ‘Deploy’ and choose the Pay-as-you-go (PAYG) deployment option.

Subscribe and Access: Subscribe to the offer to gain access to the model (usage charges apply), then proceed to deploy it.

Explore the Playground: After deployment, you will automatically be redirected to the Playground. Here, you can explore the model's capabilities.

Customize Settings: Adjust the context or inference parameters to fine-tune the model's predictions to your needs.

Access Programmatically: Click on the “View code” button to obtain the API, keys, and a code snippet. This enables you to access and integrate the model programmatically. 
Generate Data/Distillation: Use the distillation recipe or data generation recipe to generate data and/or distill models using the deployed models.

Integrate with Tools: Use the provided API in Large Language Model (LLM) tools such as prompt flow, Semantic Kernel, LangChain, or any other tools that support REST API with key-based authentication for making inferences.

Looking Forward

Microsoft’s introduction of the Llama 3.1 405B models underscores our commitment to providing cutting-edge AI models that drive business transformation. By integrating this powerful model into your operations, customers can leverage its advanced capabilities for synthetic data generation and model distillation, producing domain task-specific models for tailored industry use cases.

FAQ 

Cost: What does it cost to use Llama 3.1 405B on Azure?

You are billed based on the number of prompt and completions tokens. You can review the pricing on the Llama 3.1 405B offer in the Azure Marketplace offer details tab when deploying the model. You can also find the pricing on the Azure Marketplace.

Regional availability: Are Llama 3.1 models' region specific on Azure?

Llama 3.1 405B,70B and 8B are available through MaaS as serverless API endpoints.

These endpoints can be created in Azure AI Foundry projects or Azure Machine Learning workspaces. Cross-regional support for these endpoints is available for any region in the US.

Fine-tuning jobs for 8B Instruct and 70B Instruct are available in West US 3.

Please note that if you would like to use any of these three MaaS models in prompt flow within Azure AI Foundry projects or Azure Machine Learning workspaces in other regions, you can use the API endpoint and key as a connection to prompt flow manually. Meaning which, you can use the AI endpoint from any Azure region once it’s been created in East US 2 (for 405B Instruct, 70B Instruct, 8B Instruct) and/or in Sweden Central (70B Instruct, 8B Instruct).

GPU capacity quota: Which models do I require GPU capacity quota in my Azure subscription?

Meta-Llama -3.1-405B-Instruct, Meta-Llama-3.1-70B-Instruct, Meta-Llama-3.1-8B-Instruct are available through MaaS as serverless API endpoints. You don’t require GPU capacity quota in your Azure subscription to deploy these models.

However, if you would like to deploy any of: Meta-Llama-3.1-70B-Instruct, Meta-Llama-3.1-8B-Instruct, Meta-Llama-3.1-70B, Meta-Llama-3.1-8B, Llama-Guard-3-8B and Prompt-Guard-86M, provided you have the relevant associated GPU capacity quota availability as part of a managed compute offering, you will be able to deploy these models.

Azure Marketplace: Llama3.1 405B is listed on the Azure Marketplace. Can I purchase and use Llama 3.1 405B directly from Azure Marketplace?

Azure Marketplace is our foundation for commercial transactions for models built on or built for Azure. The Azure Marketplace enables the purchasing and billing of Llama 3.1 405B. However, model discoverability occurs in both Azure Marketplace and the Azure AI Foundry model catalog. Meaning you can search and find Llama 3.1 405B in both the Azure Marketplace and Azure AI Foundry model catalog.

If you search for Llama 3.1 405B in Azure Marketplace, you can subscribe to the offer before being redirected to the Azure AI Foundry model catalog where you can complete subscribing and can deploy the model.

If you search for Llama 3.1 405B in the Azure AI Foundry model catalog, you can subscribe and deploy the model from the Azure AI Foundry model catalog without starting from the Azure Marketplace. The Azure Marketplace still tracks the underlying commerce flow.

The above is true for Llama 3.1 70B and Llama 3.1 8B as MaaS models, where the commerce flow is supported by Azure Marketplace.

MACC: Given that Llama 3.1 405Bis billed through the Azure Marketplace, does it retire my Azure consumption commitment (aka MACC)?

Yes, Llama 3.1 405B is an “Azure benefit eligible” Marketplace offer, which indicates MACC eligibility. Learn more about MACC here: https://learn.microsoft.com/en-us/marketplace/azure-consumption-commitment-benefit

Data privacy: Is my inference data shared with Meta?

No, Microsoft does not share the content of any inference request or response data with Meta.

Microsoft acts as the data processor for prompts and outputs sent to and generated by a model deployed for pay-as-you-go inferencing (MaaS). Microsoft doesn't share these prompts and outputs with the model provider, and Microsoft doesn't use these prompts and outputs to train or improve Microsoft's, the model providers, or any third party's models.  Read more on data, security and privacy for Models-as-a-Service.

Are there rate limits for the Meta models on Azure?

Meta models come with 400 K tokens per minute and 1 K requests per minute limit. Reach out to Azure customer support if this doesn’t suffice.

Can I use MaaS models in any Azure subscription types?

Customers can use MaaS models in all Azure subsection types with a valid payment method, except for the CSP (Cloud Solution Provider) program. Free or trial Azure subscriptions are not supported.

Can I fine-tune the Llama 3.1 405B model? What about other models?

Not yet for 405B Instruct – stay tuned!

Models available to fine-tune today:

Deployment as serverless API (MaaS): 8B Instruct and 70B Instruct.

Deployment as managed compute: 8B Instruct, 70B Instruct, 8B, 70B.

Please note: This article was edited on Dec 27, 2024 to reflect updated naming for Azure AI Foundry (formerly Azure AI Studio). No other content has been changed. Learn more about Azure AI Foundry.

Updated Dec 27, 2024

Version 5.0

artificial intelligence

azure machine learning

machine learning

model catalog

ashasharma

Microsoft

Joined July 21, 2024

View Profile

Microsoft Foundry Blog

Follow this blog board to get notified when there's new activity